DB2 Jupyter Notebook Extensions Tutorial

The SQL code tutorials for DB2 rely on a Jupyter notebook extension, commonly refer to as a "magic" command. The beginning of all of the notebooks begin with the following command which will load the extension and allow the remainder of the notebook to use the %sql magic command.

%run db2.ipynb
The cell below will load the db2 extension. Note that it will take a few seconds for the extension to load, so you should generally wait until the "DB2 Extensions Loaded" message is displayed in your notebook.


In [ ]:
%run db2.ipynb

Connections to DB2

Before any SQL commands can be issued, a connection needs to be made to the DB2 database that you will be using. The connection can be done manually (through the use of the CONNECT command), or automatically when the first %sql command is issued.

The DB2 magic command tracks whether or not a connection has occured in the past and saves this information between notebooks and sessions. When you start up a notebook and issue a command, the program will reconnect to the database using your credentials from the last session. In the event that you have not connected before, the system will prompt you for all the information it needs to connect. This information includes:

  • Database name (SAMPLE)
  • Hostname - localhost (enter an IP address if you need to connect to a remote server)
  • PORT - 50000 (this is the default but it could be different)
  • Userid - DB2INST1
  • Password - No password is provided so you have to enter a value
  • Maximum Rows - 10 lines of output are displayed when a result set is returned

There will be default values presented in the panels that you can accept, or enter your own values. All of the information will be stored in the directory that the notebooks are stored on. Once you have entered the information, the system will attempt to connect to the database for you and then you can run all of the SQL scripts. More details on the CONNECT syntax will be found in a section below.

The next statement will force a CONNECT to occur with the default values. If you have not connected before, it will prompt you for the information.


In [ ]:
%sql CONNECT

Line versus Cell Command

The DB2 extension is made up of one magic command that works either at the LINE level (%sql) or at the CELL level (%%sql). If you only want to execute a SQL command on one line in your script, use the %sql form of the command. If you want to run a larger block of SQL, then use the %%sql form. Note that when you use the %%sql form of the command, the entire contents of the cell is considered part of the command, so you cannot mix other commands in the cell.

The following is an example of a line command:


In [ ]:
%sql VALUES 'HELLO THERE'

The %sql syntax allows you to pass local variables to the script. There are 5 predefined variables defined in the program:

  • db2_database - The name of the database you are connected to
  • db2_uid - The userid that you connected with
  • db2_host = The IP address of the host system
  • db2_port - The port number of the host system
  • db2_max - The maximum number of rows to return in an answer set

To pass a value to a LINE script, use the braces {} to surround the name of the variable:

   {db2_database}

The next line will display the currently connected database.


In [ ]:
%sql VALUES '{db2_database}'

You cannot use variable substitution with the CELL version of the %sql command. If your SQL statement extends beyond one line, and you want to use variable substitution, you can use a couple of techniques to make it look like one line. The simplest way is to add the backslash character (\) at the end of every line. The following example illustrates the technique.


In [ ]:
%sql VALUES \
  '{db2_database}'

If you have SQL that requires multiple lines, of if you need to execute many lines of SQL, then you should be using the CELL version of the %sql command. To start a block of SQL, start the cell with %%sql and do not place any SQL following the command. Subsequent lines can contain SQL code, with each SQL statement delimited with the semicolon (;). You can change the delimiter if required for procedures, etc... More details on this later.


In [ ]:
%%sql
VALUES
  1,
  2,
  3

If you are using a single statement then there is no need to use a delimiter. However, if you are combining a number of commands then you must use the semicolon.


In [ ]:
%%sql
DROP TABLE STUFF;
CREATE TABLE STUFF (A INT);
INSERT INTO STUFF VALUES
  1,2,3;
SELECT * FROM STUFF;

The script will generate messages and output as it executes. Each SQL statement that generates results will have a table displayed with the result set. If a command is executed, the results of the execution get listed as well. The script you just ran probably generated an error on the DROP table command.

Options

Both forms of the %sql command have options that can be used to change the behavior of the code. For both forms of the command (%sql, %%sql), the options must be on the same line as the command:

%sql -t ...
%%sql -t

The only difference is that the %sql command can have SQL following the parameters, while the %%sql requires the SQL to be placed on subsequent lines.

There are a number of parameters that you can specify as part of the %sql statement.

  • -d - Use alternative delimiter
  • -t - Time the statement execution
  • -n - Run all statements as commands (no answer sets)
  • -s - Run all statements as SQL
  • -q - Suppress messages
  • -qq - Suppress messages and any SQL output
  • -j - JSON formatting of a column
  • -a - Show all output
  • -pb - Bar chart of results
  • -pp - Pie chart of results
  • -pl - Line chart of results
  • -sampledata Load the database with the sample EMPLOYEE and DEPARTMENT tables
  • -r - Return the results into a variable (list of rows)

Multiple parameters are allowed on a command line. Each option should be separated by a space:

%sql -a -j ...

The sections below will explain the options in more detail.

Delimiters

The default delimiter for all SQL statements is the semicolon. However, this becomes a problem when you try to create a trigger, function, or procedure that uses SQLPL (or PL/SQL). Use the -d option to turn the SQL delimiter into the at (@) sign. The semi-colon is then ignored as a delimiter.

For example, the following SQL will use the @ sign as the delimiter.


In [ ]:
%%sql -d
DROP TABLE STUFF
@
CREATE TABLE STUFF (A INT)
@
INSERT INTO STUFF VALUES
  1,2,3
@
SELECT * FROM STUFF
@

The delimiter change will only take place for the statements following the %%sql command. Subsequent cells in the notebook will still use the semicolon. You must use the -d option for every cell that needs to use the semicolon in the script.

Limiting Result Sets

The default number of rows displayed for any result set is 10. You have the option of changing this option when initially connecting to the database. If you want to override the number of rows display you can either update the control variable, or use the -a option. The -a option will display all of the rows in the answer set. For instance, the following SQL will only show 10 rows even though we inserted 15 values:


In [ ]:
%sql values 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15

You will notice that the displayed result will split the visible rows to the first 5 rows and the last 5 rows. Using the -a option will display all values:


In [ ]:
%sql -a values 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15

To change the default value of rows displayed, you can either do a CONNECT RESET (discussed later) or set the db2 control variable db2_max to a different value. A value of -1 will display all rows.


In [ ]:
# Save previous version of maximum rows
last_max = db2_max
db2_max = 5
%sql values 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
# Set the maximum back
db2_max = last_max

Quiet Mode

Every SQL statement will result in some output. You will either get an answer set (SELECT), or an indication if the command worked. For instance, the following set of SQL will generate some error messages since the tables will probably not exist:


In [ ]:
%%sql
DROP TABLE TABLE_NOT_FOUND;
DROP TABLE TABLE_SPELLED_WRONG;

If you know that these errors may occur you can silence them with the -q option.


In [ ]:
%%sql -q
DROP TABLE TABLE_NOT_FOUND;
DROP TABLE TABLE_SPELLED_WRONG;

SQL output will not be suppressed, so the following command will still show the results.


In [ ]:
%%sql -q
DROP TABLE TABLE_NOT_FOUND;
DROP TABLE TABLE_SPELLED_WRONG;
VALUES 1,2,3;

To have the messages returned as text only, you must set the db2_error_highlight variable to False. This is a change that will affect all messages in the notebook.


In [ ]:
db2_error_highlight = False
%sql DROP TABLE TABLE_NOT_FOUND;

To set the messages back to being formatted, set db2_error_highlight to True.


In [ ]:
db2_error_highlight = True
%sql DROP TABLE TABLE_NOT_FOUND;

SQL and Command Mode

The %sql looks at the beginning of your SQL to determine whether or not to run it as a SELECT statement, or as a command. There are three statements that trigger SELECT mode: SELECT, WITH, and VALUES. Aside from CONNECT, all other statements will be considered to be an SQL statement that does not return a result set. In most cases this will work fine. However, it is possible that you may have a statement that is actually a command and does not return a result set. However, because of the simplistic nature of determining the statement type, the statement will not be executed properly. To treat all statements as commands, use the -n flag (no output) and the -s flag for output.

Timing SQL Statements

Sometimes you want to see how the execution of a statement changes with the addition of indexes or other optimization changes. The -t option will run the statement on the LINE or one SQL statement in the CELL for exactly one second. The results will be displayed and optionally placed into a variable. The syntax of the command is:

sql_time = %sql -t SELECT * FROM EMPLOYEE
For instance, the following SQL will time the VALUES clause.


In [ ]:
%sql -t VALUES 1,2,3,4,5,6,7,8,9

When timing a statement, no output will be displayed. If your SQL statement takes longer than one second you will need to modify the db2_runtime variable. This variable must be set to the number of seconds that you want to run the statement.


In [ ]:
db2_runtime = 5
%sql -t VALUES 1,2,3,4,5,6,7,8,9

JSON Formatting

DB2 supports querying JSON that is stored in a column within a table. Standard output would just display the JSON as a string. For instance, the following statement would just return a large string of output.


In [ ]:
%%sql
VALUES 
      '{
      "empno":"000010",
      "firstnme":"CHRISTINE",
      "midinit":"I",
      "lastname":"HAAS",
      "workdept":"A00",
      "phoneno":[3978],
      "hiredate":"01/01/1995",
      "job":"PRES",
      "edlevel":18,
      "sex":"F",
      "birthdate":"08/24/1963",
      "pay" : {
        "salary":152750.00,
        "bonus":1000.00,
        "comm":4220.00}
      }'

Adding the -j option to the %sql (or %%sql) command will format the first column of a return set to better display the structure of the document. Note that if your answer set has additional columns associated with it, they will not be displayed in this format.


In [ ]:
%%sql -j
VALUES 
      '{
      "empno":"000010",
      "firstnme":"CHRISTINE",
      "midinit":"I",
      "lastname":"HAAS",
      "workdept":"A00",
      "phoneno":[3978],
      "hiredate":"01/01/1995",
      "job":"PRES",
      "edlevel":18,
      "sex":"F",
      "birthdate":"08/24/1963",
      "pay" : {
        "salary":152750.00,
        "bonus":1000.00,
        "comm":4220.00}
      }'

Plotting

Sometimes it would be useful to display a result set as either a bar, pie, or line chart. The first one or two columns of a result set need to contain the values need to plot the information.

The three possible plot options are:

  • -pb - bar chart (x,y)
  • -pp - pie chart (y)
  • -pl - line chart (x,y)

The following data will be used to demonstrate the different charting options.


In [ ]:
%sql values 1,2,3,4,5

Since the results only have one column, the pie, line, and bar charts will not have any labels associated with them. The first example is a bar chart.


In [ ]:
%sql -pb values 1,2,3,4,5

The same data as a pie chart.


In [ ]:
%sql -pp values 1,2,3,4,5

And finally a line chart.


In [ ]:
%sql -pl values 1,2,3,4,5

If you retrieve two columns of information, the first column is used for the labels (X axis or pie slices) and the second column contains the data.


In [ ]:
%sql -pb values ('A',1),('B',2),('C',3),('D',4),('E',5)

For a pie chart, the first column is used to label the slices, while the data comes from the second column.


In [ ]:
%sql -pp values ('A',1),('B',2),('C',3),('D',4),('E',5)

Finally, for a line chart, the x contains the labels and the y values are used.


In [ ]:
%sql -pl values ('A',1),('B',2),('C',3),('D',4),('E',5)

The following SQL will plot the number of employees per department.


In [ ]:
%%sql -pb
SELECT WORKDEPT, COUNT(*) 
  FROM EMPLOYEE
GROUP BY WORKDEPT

Sample Data

Many of the DB2 notebooks depend on two of the tables that are found in the SAMPLE database. Rather than having to create the entire SAMPLE database, this option will create and populate the EMPLOYEE and DEPARTMENT tables in your database. Note that if you already have these tables defined, they will not be dropped.


In [ ]:
%sql -sampledata

Result Set

By default, any %sql block will return the contents of a result set as a table that is displayed in the notebook. If you want to capture the results from the SQL into a variable, you would use the -r option:

var = %sql -r select * from employee
The result is a list of rows. Each row is a list itself. The rows and columns all start at zero (0), so to access the first column of the first row, you would use var[0][0] to access it.


In [ ]:
rows = %sql -r select * from employee
print(rows[0][0])

The number of rows in the result set can be determined by using the length function.


In [ ]:
print(len(rows))

If you want to iterate over all of the rows and columns, you could use the following Python syntax instead of creating a for loop that goes from 0 to 41.


In [ ]:
for row in rows:
    line = ""
    for col in row:
        line = line + str(col) + ","
    print(line)

Since the data may be returned in different formats (like integers), you should use the str() function to convert the values to strings. Otherwise, the concatenation function used in the above example will fail. For instance, the 6th field is a birthdate field. If you retrieve it as an individual value and try and concatenate a string to it, you get the following error.


In [ ]:
print("Birth Date="+rows[0][6])

You can fix this problem by adding the str function to convert the date.


In [ ]:
print("Birth Date="+str(rows[0][6]))

DB2 CONNECT Statement

As mentioned at the beginning of this notebook, connecting to DB2 is automatically done when you issue your first %sql statement. Usually the program will prompt you with what options you want when connecting to a database. The other option is to use the CONNECT statement directly. The CONNECT statement is similar to the native DB2 CONNECT command, but includes some options that allow you to connect to databases that has not been catalogued locally.

The CONNECT command has the following format:

%sql CONNECT TO <database> USER <userid> USING <password | ?> HOST <ip address> PORT <port number>
If you use a "?" for the password field, the system will prompt you for a password. This avoids typing the password as clear text on the screen. If a connection is not successful, the system will print the error message associated with the connect request.

If the connection is successful, the parameters are saved on your system and will be used the next time you run a SQL statement, or when you issue the %sql CONNECT command with no parameters.

If you want to force the program to connect to a different database (with prompting), use the CONNECT RESET command. The next time you run a SQL statement, the program will prompt you for the the connection and will force the program to reconnect the next time a SQL statement is executed.


In [ ]:
%sql CONNECT RESET

In [ ]:
%sql CONNECT

Credits: IBM 2017, George Baklarz [baklarz@ca.ibm.com]